Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

نویسندگان

  • Kristina M. Hettne
  • Antony J. Williams
  • Erik M. van Mulligen
  • Jos Kleinjans
  • Valery Tkachenko
  • Jan A. Kors
چکیده

BACKGROUND Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. RESULTS We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. CONCLUSIONS We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

One Tagger, Many Uses: Illustrating the Power of Ontologies in Dictionary-based Named Entity Recognition

Automatic annotation of text is an important complement to manual annotation, because the latter is highly labour intensive. We have developed a fast dictionary-based named entity recognition (NER) system and addressed a wide variety of biomedical problems by applied it to text from many different sources. We have used this tagger both in real-time tools to support curation efforts and in pipel...

متن کامل

Large-scale cross-media analysis and mining from socially curated contents

The major interest of the current social network service (SNS) developers and users are rapidly shifting from conventional text-based (micro)blogs such as Twitter and Facebook to multimedia contents such as Flickr, Snapchat, MySpace and Tumblr. However, the ability to analyze and exploit unorganized multimedia contents on those services still remain inadequate, even with state-of-the-art media ...

متن کامل

The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge

Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing solutions to problems of practical interest. Most notably, community-oriented initiatives such as the BioCreative challenge have enabled controlled environments...

متن کامل

StreptomeDB: a resource for natural compounds isolated from Streptomyces species

Bacteria from the genus Streptomyces are very important for the production of natural bioactive compounds such as antibiotic, antitumour or immunosuppressant drugs. Around two-thirds of all known natural antibiotics are produced by these bacteria. An enormous quantity of crucial data related to this genus has been generated and published, but so far no freely available and comprehensive databas...

متن کامل

Argo: an integrative, interactive, text mining-based workbench supporting curation

Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variet...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2010